MEDB 5501, Module11
2025-11-04
Topics to be covered
- What you will learn
- Analysis of variance is linear regression
- R code for analysis of variance
- Log transformation
- R code for analysis of variance
- Kruskal-Wallis test
- R code for analysis of variance
- Your homework
Indicator variables, 1
- Two levels is less complex with three or more levels
- R assigns 0 to first category (alphabetically)
- R assigns 1 to second category
- Examples:
- Female=0, Male=1
- Man=0, Woman=1
- Interpretation
- Intercept is estimated average outcome for first category
- Slope is the estimated average change
- Second category average minus first category average
Example: Turtle experiment, 1
Sex Fed Fasted10 Fasted20
Male 42.8 42.4 38.9
Male 43.1 42.2 40.3
Male 40.4 40.8 37.5
Male 46.6 45.9 42.9
Female 42.2 42.4 39.7
Female 38.7 38.1 35.8
Female 35.3 34.3 32.3
Female 40.5 40.1 37.3
Turtle experiment, 2
Sex Fed Fasted10 Fasted20 Indicator
Male 42.8 42.4 38.9 1
Male 43.1 42.2 40.3 1
Male 40.4 40.8 37.5 1
Male 46.6 45.9 42.9 1
Female 42.2 42.4 39.7 0
Female 38.7 38.1 35.8 0
Female 35.3 34.3 32.3 0
Female 40.5 40.1 37.3 0
Indicator variables, 2
- With k levels, you need k-1 indicators
- First indicator
- R assigns 1 to second category (alphabetically)
- R assigns 0 to all other categories
- Second indicator
- R assigns 1 to third category (alphabetically)
- R assigns 0 to all other categories
- And so on
- Note: the first category in alphabetical order is zero for all indicator variables.
Example with dietary cracker data
Cracker Digested
control 1772.84
bran 1752.63
combo 2121.97
gum 2558.61
gum 2026.91
bran 2047.42
combo 2254.75
control 2353.21
combo 2153.36
gum 2331.19
bran 2547.77
...
Dietary cracker data, first indicator variable
Cracker Digested i1
control 1772.84 0
bran 1752.63 0
combo 2121.97 1
gum 2558.61 0
gum 2026.91 0
bran 2047.42 0
combo 2254.75 1
control 2353.21 0
combo 2153.36 1
gum 2331.19 0
bran 2547.77 0
...
Dietary cracker data, second indicator variable
Cracker Digested i1 i2
control 1772.84 0 1
bran 1752.63 0 0
combo 2121.97 1 0
gum 2558.61 0 0
gum 2026.91 0 0
bran 2047.42 0 0
combo 2254.75 1 0
control 2353.21 0 1
combo 2153.36 1 0
gum 2331.19 0 0
bran 2547.77 0 0
...
Dietary cracker data, third indicator variable
Cracker Digested i1 i2 i3
control 1772.84 0 1 0
bran 1752.63 0 0 0
combo 2121.97 1 0 0
gum 2558.61 0 0 1
gum 2026.91 0 0 1
bran 2047.42 0 0 0
combo 2254.75 1 0 0
control 2353.21 0 1 0
combo 2153.36 1 0 0
gum 2331.19 0 0 1
bran 2547.77 0 0 0
...
Interpretation with multiple indicator variables
- Intercept is estimated average outcome for first category
- First slope is the estimated change
- Second category average minus first category average
- Second slope is the estimated change
- Third category average minus first category average
- And so on
What if you want a different reference category?
Create your own indicator variables
dietary |>
mutate(i1=as.numeric(Cracker=="bran")) |>
mutate(i2=as.numeric(Cracker=="combo")) |>
mutate(i3=as.numeric(Cracker=="gum")) -> dietary_1
or revise the order using the factor function.
new_order <- c("control", "bran", "combo", "gum")
dietary |>
mutate(Cracker=factor(Cracker, levels=new_order)) -> dietary_2
lm()
Example using fruitfly lifespans, 1
- Experiment with 125 cages
- Does fruitfly mating affect average male lifespan?
- Isolate a male fruitfly with
- one or eight virgin females,
- one or eight pregnant females, or
- no females
- Note: males will not mate with pregnant females
Example using fruitfly lifespans, 2
- Cages 1-25 have one male fruitfly, no female fruit flies
- Cages 26-50 have one male fruitfly, one pregnant female
- Cages 51-75 have one male fruitfly, one virgin female
- Cages 76-100 have one male fruitfly, eight pregnant females
- Cages 101-125 have one male fruitfly, eight virgin females
Fruitfly lifespan boxplots
Fruitfly lifespan group means
# A tibble: 5 × 4
cage longevity_mn longevity_sd n
<chr> <dbl> <dbl> <int>
1 Eight pregnant females 63.4 14.5 25
2 Eight virgin females 38.7 12.1 25
3 No females 63.6 16.5 25
4 One pregnant female 64.8 15.7 25
5 One virgin female 56.8 14.9 25
Break #1
- What you have learned
- Analysis of variance is linear regression
- What’s coming next
- R code for analysis of variance
Break #2
- What you have learned
- R code for analysis of variance
- What’s coming next
Analysis of variance model
- Sample 1: \(Y_{11}, Y_{12}, \cdots, Y_{1n_1}\) are \(N(\mu_1, \sigma)\)
- Sample 2: \(Y_{21}, Y_{22}, \cdots, Y_{2n_2}\) are \(N(\mu_2, \sigma)\)
- \(\cdots\)
- Sample k: \(Y_{k1}, Y_{k2}, \cdots, Y_{kn_k}\) are \(N(\mu_k, \sigma)\) \(\ \)
- \(H_0:\ \mu_1=\mu_2=...=\mu_k\)
- \(H_1:\ \mu_i \ne \mu_j\) for some i, j
Violation of assumptions
- Non-normality
- Heterogeneity
- Lack of independence
How logarithms convert multiplication and division
- \(log(a \times b)=log(a)+log(b)\)
- \(log(a / b)=log(a)-log(b)\)
The slide rule uses this property
The book of logarithms
Break #3
- What you have learned
- What’s coming next
- R code for analysis of variance
Break #4
- What you have learned
- R code for analysis of variance
- What’s coming next
The Chi-squared distribution
- Often denoted as \(\chi^2_{df}\)
- Has a single parameter, degrees of freedom
- Never negative
- Skewed right
- Mean equals degrees of freedom
- Calculations in R
- pchisq(x, df) = \(P[\chi^2_{df} < x]\)
- qchisq(p, df) = \(p^{th}\) quantile, \(\chi^2_{df, p}\)
The Chi-squared distribution
Computing the Kruskal-Wallis test
- Rank the observations, \(R(X_{ij})\)
- 1 for smallest, 2 for second smallest, etc.
- Compute the average rank in each group, \(\bar{R}_i\)
- Compute the overall rank, \(\bar{R}\)
- T = \((N-1)\frac{\Sigma n_i(\bar{R}_i - \bar{R})^2}{\Sigma \Sigma(R(X_{ij})-\bar{R})^2}\)
Decision rule for Kruskal-Wallis test, 1
- Accept \(H_0\) if T <- \(\chi^2_{df, 1-\alpha}\)
- Accept \(H_0\) if p-value > \(\alpha\)
- p-value = \(P[\chi^2_{df} > T]\)
Decision rule for Kruskal-Wallis test, 1
- Null hypothesis is difficult to define
- Does not involve population means
- Some claim involvement of population medians
- Stochastic dominance
- \(P[X_{aj} > X_{bj}] > 0.5\) for some a and b.
Application of Kruskal-Wallis test to fruitfly longevity
kruskal.test(longevity ~ cage, data=fly_1)
Kruskal-Wallis rank sum test
data: longevity by cage
Kruskal-Wallis chi-squared = 37.961, df = 4, p-value = 1.142e-07
Break #5
- What you have learned
- What’s coming next
- R code for analysis of variance
Break #6
- What you have learned
- R code for analysis of variance
- What’s coming next
Summary
- What you have learned
- Analysis of variance is linear regression
- R code for analysis of variance
- Log transformation
- R code for analysis of variance
- Kruskal-Wallis test
- R code for analysis of variance
- Your homework